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Phase II dose finding studies in clinical drug development are typically 
conducted to adequately characterize the dose response relationship of a new 
drug. An important decision is then on the choice of a suitable dose response 
function to support dose selection for the subsequent Phase III studies. In 
this paper we compare different approaches for model selection and model 
averaging using mathematical properties as well as simulations. Accordingly, 
we review and illustrate asymptotic properties of model selection criteria and 
investigate their behavior when changing the sample size but keeping the 
effect size constant. In a large scale simulation study we investigate how 
the various approaches perform in realistically chosen settings. Finally, the 
different methods are illustrated with a recently conducted Phase II dose¬ 
finding study in patients with chronic obstructive pulmonary disease. 
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1. Introduction 


A critical decision in pharmaceutical drug development is the selection of an appropriate 
dose for conhrmatory Phase III clinical trials and potential marketing authorization. 
For this purpose, dose hnding studies are conducted in Phase II to investigate the dose 
response relationship of usually 3 — 7 active d oses in the intended patient population for 
a clinically relevant endpoint; see iTind (120061) among many others. 

Traditionally, dose response studies were analyzed by treating dose as a categorical 
variable in an analysis-of-variance (ANOVA) model. Only in the past 20 years the use of 
regression modeling approaches where dose is treated as a quan titative variable has be¬ 
come more popular. We refer to, for example, iBretz et ahl (120081) for an overview of both 
approaches, and the White Paper of the Pharmaceutical Research an d Manufacturers 


of Am erica (PhRMA) working group on adaptive dose ranging studies (iBornkamp et ah 


(120071) ) for a comparison of different ANOVA and regression-based approaches. 

If a non-linear regression model is adopted, a natural question is which regression 
(i.e. dose response) function to utilize. This becomes even more important in the 
regulated context of pharmaceutical drug development, where the employed regression 
model should be pre-specihed at the design stage. This specihcation thus takes place at 
a time, when only limited information is available abou t the dos e respo n se relationship. 


result ing in model uncertainty. Several authors (e.g. iThomasI (120061) : iDragalin et al. 


(1200711 ) argued that a flexible monotonic model, such as an Sigmoid Emax model, can be 
used for all practical purposes, as it approximates the commonly observed dose response 
shapes well. While generally applicable, this flexible model can sometimes be challenging 
to £t with a small number of doses. In addition, while several models might fit the data 
similarly well, due to the often sparse data they might still differ on certain estimated 
quantities of interest, e.g. the target dose estima te. _ 


The MCP-Mod method (see lBretz et al.l (120051) : IPinheiro et ahl (120141) : ICHMPl (120141) ) 


tries to address the model uncertainty problem by acknowledging it explicitly as part 
of the methodology. The main idea is to determine a candidate set of dose response 
models at the trial design stage. After completing the trial one either selects a single 
dose response function out of the candidate model set or applies model averaging based 
on the individual model hts. Thus, t he MCP-Mod appro ach allows one to employ either 
model selection or model averaging. IVerrier et al.l (120141 ) discussed by means of two real 
examples their experiences on how to proceed with model selection and model averaging 
using MCP-Mod in practice. 

Model selection has the advantage that it results in a single model ht, which eases 
the interpretation and communication. But it is also known that selecting a single 
model and ignoring the uncertainty resulting from the selection will result in confidence 


Bornkamn 

(2015 

) for a high-level discussion or Chapter 7 in 

Claeskens and Hiort 

(2005) 


for a mathematical treatment. A partial solution to this problem is to use model av¬ 
eraging. By acknowledging model uncertainty explicitly as part of the inference one 
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will typically obtain more adequate (i.e usually wider) confidence intervals. There ex- 
ists empirical evide n ce tha t m odel averaging a lso improves the estimation efficiency (see 
Rafterv and Zhend (120031) or iBreimanI fll996l ) ), even though authors did not consider 
dose-finding setting in particular. 

The purpose of this paper is to investigate and compare different model selection and 
model averaging approaches in the context of Phase II dose finding studies. Accordingly, 
we introduce in Section [2] a motivating case study to illustrate the various approaches in¬ 
vestigated throughout this paper. Next, we briefly review the mathematical background 
of different selection criteria and compare them with respect to some of their asymptotic 
properties in Section [3l In Section 01 we describe the results of an extensive simulation 
study. We revisit the case study in Section [5] and provide some general conclusions in 
Section [6l 


2. A Case Study in Chronic Obstructive Pulmonary 
Disease (COPD) 

This example refers to a Phase II clinical study of a new drug in patients with chronic 
obstructive pulmonary disease (COPD). The primary endpoint of the study was mea¬ 
sured through the forced expiratory volume in one second (FEVi) measured in liter, after 
7 days of treatment. The objective of this study was to determine the dose response 
relationship and the target dose that achieves an effect of 6 over placebo. In COPD 
an improvement 6 of 0.1 — 0.14 liters on top of the placebo response are considered 
clinically relevant. To this end, four active dose levels (12.5, 25, 50 and 100 mg) were 
compared with placebo. Point estimates and standard errors for the treatment groups 
resulting from an ANCOVA fit are available from chnicaltrials.gov (NCT00501852). 
The original study des ign was a four-period incomplete block cross-over study; see also 
Verkindre et al.l (120101) . For the purpose of this article we simulated a parallel group 
design of 60 patients per group (thus 300 patients in total), so that the point estimates 
and standard errors match the reported estimates exactly. Figure [2T] displays the mean 
responses at the five dose levels (including placebo) together with the marginal 95% 
confidence intervals. 

For our purposes, we assume that five candidate models had been identified at the 
design stage to best describe the data after completing the trial. More specifically, 
we assume the five dose-response functions summarized in Table 12.11 namely the linear, 
quadratic, Emax, Sigmoid Emax and ANOVA model; see Section [3] for the notation used 
in Table 12.11 The questions at hand are (i) which of these candidate models should be 
used for the dose response modeling step, (ii) whether model selection or averaging should 
be used, and (hi) which specific information criteria should be employed to perform either 
model selection or averaging. We will revisit and analyse this case study in Section [5l 
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Figure 2.1: Mean responses and marginal 95% confidence intervals for the COPD case 
study. 


Number 

Model 

Function r]{-,9) 

Parameter specihcations 

1 

2 

3 

4 

5 

Linear 

Quadratic 

Emax 

Sigmoid Emax 
ANOVA 

■do + did 

do -l- did -|- d2d^ 

1 

^0 + ^,+d 

„Q 1 l9ld'^3 

^0 + 

'r]{di,9) = dj, i = 1,.. .,k 

9i = (0,-1.65/8) 
92 = (0, -1.65/3,1.65/36) 
03 = (0,-1.81,0.79) 
04 = (0,-1.7,4, 5) 
05 = (0, -1.29, -1.35, -1.42, -1.5, 
-1.6,-1.63,-1.65,-1.65) 


Table 2.1: The five candidate dose response models utilized in the case study, together 
with the parameter specifications used in the simulation study from Section\4TJ\ 


3. Model Selection and Model Averaging 

We assume k different dose levels di,... ,dk, where often di = 0 is the placebo. The 
set S = {di,... ,dk) of k = k(E) dose levels is called design throughout this paper. 
We further assume that for each dose level di we have n* patients i = 1,... ,k, where 
N = The individual responses are denoted by 

dll) ■ ■ ■ ) dlni) • • • ) dfcl) • • • ) Ukrik- (3-1) 

Throughout this paper, we assume that the observations in fl3.ip are realizations of 
random variables dehned by 

Yij = r]{di,9) + Eij j = l,...,ni,i = l,...,k, (3.2) 

where Sn,... ,eknk are independent and normally distributed random variables, i.e. 
Eij ~ A/'(0,cr^). Here, ri{di,9) denotes the mean response at dose dj. The compet¬ 
ing dose response mean functions in fl3.2p are denoted by r]fid,9fi), t = 1,... ,L. For 
example, in the case study presented in Section [2] we assumed the L = 5 candidate 
models Wli,..., Wls, summarized in Table 12.11 
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Model Selection Criterion I 

Penalty Term for model Adi 

AIC 

dMi 

AICc 

NcLm^ 

— 1 

BIG 

0.51og(A^)dx^ 

BIC 2 

0.5{log{N)dMt - log(27r)d;wJ 

TIC 



Table 3.1; Five model selection criteria and their corresponding penalty terms investigated 
in this paper, where dMt denotes the dimension of the parameter for model 

In the remainder of this section we give a brief overview of commonly used information 
criteria for selecting a model from a given class of competing models. All criteria can 
be represented in the form 


2 maxlog— 2penig (3.3) 

where £jv denotes the likelihood function and pen^j a penalty term which differs for the 
different models Aii and selection criteria I. Table lO summarizes the penalty terms of 
different criteria that will be introduced below and investigated in later sections. 

3.1. Information criteria based on the AlC 

AlC-based information criteria are often motivated from an information theoretic per¬ 
spective. Let g{y\d) denote the true but unknown density of the response variable Y 
given the dose d. In order to estimate target doses of interest and the dose response 
curve, we want to identify a model A4 dehned by a parametric density PM{y\d, Om) for 
the response variable Y from a given class of L parametric models which approximates 
the true density g{y\d) best. In order to measure the quality of the approximation we 
use the Kullback Leibler divergence (KL-divergence) 

KL{pM,g) = /fog ( — n ) 9{y\di)dy- (3.4) 

^A/J \PM{y\di,0M) J 

The KL-divergence serves as a distance measure between densities. It is nonnegative and 
equal to zero if g{y\d) = PM{y\d,0M)- Based on the KL-divergence a model A4 from a 
given class of L models, say Adi ,..., is called the best approximating model if its 
density (with corresponding optimizing parameter 0^) minimizes the KL-divergence to 
the true density g(yjd) compared to the KL-divergence of the other L — 1 models. 

In practice, the identihcation of the best approximating model within a set of L 
candidate models Aii,..., A4 .l by minimizing the KL-divergence fl3.4l) is not possible 
because this criterion depends on the unknown true density. However the divergence and 
the parameters corresponding to the best approximation can be estimated from the data 
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2 / 11 , •••, 2/ini, • • •, Ukii • • •, Ukuk- Ignoring the terms that are the same across all models, 
one hence needs to minimize 


Qn{M,) :=E 


E 

2=1 


N 


\ogpi{y\di,ei)g{y\di)dy 


where the expectation is taken with respect to the distribution of the maximum likeli¬ 
hood (ML) estimator of the parameter 9i in model i = 1,..., L. 

It is known that the empirical estimator of this quantity, the log likelihood jr maxfl„ log(£Ai(A2/, 9 f)) 


i s a bi ased estimator and overestimates QNi-Mi), leading to overhtting (cf. IClaeskens and Hiort 
(1200811 L A bias corrected estimator instead is given by 


Q‘n{M,) = immlog(£w(Al,,9()) - 


(3.5) 


Using different estima tors for the penalty term t hus leads to different model selection 
criteria; see Table 13.11 IClaeskens and HiortI (120081) discussed under which circumstances 
the different penalties lead to an approximately unbiased estimation of QN{Aii), in 
Appendix we provide further technical background on asymptotic approximations of 
the bias term. Sett ing pen^j = dj^^ leads to the popular Akaike information criterion 


lAICi lAkaikel (1197411 1 


AIC(A2£) = 2maxlog(£Ar(A2£,6'£)) - 2dMr 


The co e fficient 2 is added b e cause of approximation arguments (see among others IClaeskens and Hiort 
(j2008fl L iHurvich and Tsail (1198911 pointed out that the dimension dMi is not a good es¬ 


timator of the bias for small sample sizes and proposed the penalty term 


NdMf 




leading to the corrected AIC (AICc). Also, iTakeuchil (jl976l) suggested the penalty term 

leading to Takeuchi’s or the Trace Information Criterion (TIC), 
where K denotes the Fisher information matrix and J the negative inverse of the ex¬ 
pectation of the second derivative of the log likelihood function. Both K and J are 
estimated; see Appendix lAI for details. 


3.2. Information criteria based on the BIC 


Roughly speaking, the Bayesian Information Criterion fBICi lSchwarzl (1197811 1 chooses the 
most likely model based on the data. More precisely, let Pr(Ali),... denote 

the prior probabilities for the models ATi,..., M.l and Pi{9i),... ,Pl{9l) prior distri¬ 
butions for the corresponding parameters 9i,... ,9 l, respectively. Using Bayes’ theorem 
and the observations y^ = {yu ,..., yim, • • •, yki, ■ ■ ■, ykn^) fhe posterior probability of 
model AAi is given by 


Pr{M,\Y = y^) = 


Pr{M,)\^{y 


N\ 


Y.k=iP^i^k)My 


N\ 


(3,6) 
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where Xi{y^) := v( y^ I Me) = L £ iv(M f . 6f)vf (6f)d6f, i = ^ denotes the 

marginal likelihood flWa.sserma.nnl (120001 ) and IClaeskens and HiortI (120081) among oth¬ 
ers). Note that the denominator is the same for every model under consideration, so 
that we only have to compare the numerators in (13.Op in order to compare the models. 
Additionally, if we choose equal prior weights for the models, namely Pr{Ai^) = 1/L 
for £ = 1,..., L, it suffices to consider the terms Ai,..., for model selection. In this 
case, exact Bayesian Inference would use 2 log Xi{y^) for model to compare between 
different models. For the BIG this value is approxim ated. Approximating the in arginal 
likelihoods by a Laplace approximation one obtains fjClaeskens and HiortI (120081) ) 


Xi{y^) ^ \ .1(9,) pe{9e). 

Therefore, the approximation is given by 

2m&x\og{CN{Me,9i)) - dMi log(A^) + log(27r) - \og{\J mi,{9i)\) + 2\og{pi{9i)). 


The penalty term of the BIG only uses the terms of the approximation which converge 
to infinity with increasing sample size N-. 


= 2maxi\og{CN{Me,9i)) - (Imi log(A^). 


(3.7) 


Draperl (1 19951) proposed to add the constant term \og{27r)pi in (13.7p and we refer to this 
modification of the BIG as BIG 2 . 


3.3. Properties 

In this section we investigate two properties for the model selection criteria introduced 
so far. First, we discuss cons i stency as a method to compare different model selection 
criteria (iGlaeskens and HiortI (120081) ) and illustrate the theoretical results with a simu¬ 
lation study. Second, we investigate the behavior of the criteria if the effect size (the 
ratio of treatment effect and variability) stays constant, but the sample size changes, 
which is of particular importance when designing dose finding studies in pharmaceutical 
drug development. 


3.3.1. Consistency 

Gonsistency is a popular way to compare different model selection criteria. Gonsistency 
of an information criterion ensures that it picks the best approximating model (among 
the candidate models) with a probability converging to 1 with increasing sample size. In 
general, consistency of a model selection criterion of type (13.3p depends on the structure 
of the penalty term. If the best approximating model is unique, a sufficient condition 
for consistency is that the penal ty term is strictly positive and when divided by the 


sample size converges to zero fsee IGlaeskens and HiortI (120081) . pp. 100-101). All model 
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Figure 3.1; The Probability to select the Sigmoid Emax model if the Sigmoid Emax model 
is true. The depicted lines are the smoothing splines using the data. 


selection criteria considered in Section [3]T] and [32] fulfill this requirement. However, if 
the best approximating model is not unique and there exist several best approximating 
models with different complexities (i.e. nested models), criteria with a fixed penalty 
(independent of the sample size) will not necessarily select the mo del with the smallest 


numb er of parameters in the set of best approximating models (see IClaeskens and Hiort 


(120081) . pp. 101-102). Therefore the AIC and the TIC have a tendency to overfit, whereas 
the BIC and the BIC 2 do not. 

We illustrate this using simulations. For simplicity, we consider a situation with two 
candidate models; Emax and Sigmoid Emax; see Table 12.11 The Emax model is nested 
within the Sigmoid Emax model when setting "ds = 1. We assume a fixed design where 
patients are equally randomized to one of the active doses di = 0 ,d 2 = I,... ,dg = 8 
and consider increasing sample sizes, starting with sample size N = 150, increasing to 
N = 150, 000. In the first scenario the Sigmoid Emax model is the correct model with 
parameter 6 = (0, —1.81, 0.79, 2). As predicted by asymptotic theory the AIC, the BIC, 
the BIC 2 and the TIC select the right model with probability tending to 1, because there 
is a unique best approximating model, namely the true Sigmoid Emax model itself (see 
Figure IXT]) . Comparing the rates of convergence for the different criteria, we conclude 
that AIC and TIC perform better than BIC and BIC 2 in this scenario. 
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Figure 3.2: Probability to select the Sigmoid Emax model if the Emax model is true. The 
depicted lines are the smoothing splines using the data. 


In the second scenario the Emax model is the true model with parameter 6 = (0, —1.81, 0.79). 
Both the Emax and the Sigmoid Emax model are closest to the true model with re¬ 
spect to the KL-divergence, because the Emax model is a special case of the Sigmoid 
Emax model with -da = 1. As expected the BIG and BIC 2 choose the more com¬ 
plex Sigmoid Emax model with probability tending to 0 (see Figure 13.2j) . The AIC 
and TIC choose the Sigmoid Emax model with probability tending to 15.7%. This 
value is the asymptotic probability that the AIC selects the Sigmoid Emax model, 
since AIC (Sigmoid Emax) — AIC(Emax) — 2 and P{xi > 2) = 15.7% 


see 


Claeskens and HiortI (120081) . p. 50). Summarizing, both the AIC and the TIC have a 


tendency to overfit asymptotically if the best approximating model is not unique. 


3.3.2. Dependence on the sample size 

In clinical practice, the sample size at the design stage is often calculated to ensure 
that the standard error of a quantity of interest (typically the treatment effect) is below 
a given threshold. From a practical viewpoint it would thus be desirable if a model 
selection criterion chooses the same dose response model regardless of the sample size as 
long as the standard error around the estimated dose-response curve remains the same. 
In this context one would expect that a model selection criterion behaves similarly 
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Figure 3.3; Probability to select the Sigmoid Emax model if the Sigmoid Emax model 
is true and the variance depends on the group sample size. The depicted lines are the 
smoothing splines using the data. 

if there are 100 noisy observations (e.g., with a standard deviation of ay = 10, thus 
resulting in a standard error ciy/vTOO = 1) or if there are 10 less noisy observations 
(e.g., ay = \/T0 resulting in the same standard error ay/y/TO = 1). 

We investigate the different model selection criteria with respect to this property 
using the following scenario. Consider the balanced case with equal group sample size 
n = Hi aX each dose level di,i = 0,1,... ,8. The observations are simulated from the 
Sigmoid Emax model with parameter 9 = (0,-1.81,0.79,2) and normally distributed 
errors with standard deviation cr„ = VO.Oln. That is, the variance increases with the 
sample size. Standard results on maximum likelihood estimation show that the standard 
error of all estimators depends on the sample size only through the ratio anjyfn and 
consequently this choice gives a constant standard error across the different sample sizes. 
The candidate models are again the Emax and the Sigmoid Emax model and the group 
sample size is given by n = 1, 2, 3, 5,10,15, 20,..., 50. We calculate the probability that 
the model selection criteria AIC, BIC, BIC 2 and TIC select the Sigmoid Emax model 
under the assumption that the latter is the true model. The results are displayed in 
Figure 

We observe that the probability to choose the Sigmoid Emax model is nearly constant 
for the AIC and TIC, unless the sample size is very small. On the other hand the BIC’s 
and the BIC 2 ’s probabilities depend on the sample size since the probability to select 
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the Sigmoid Emax model decreases with increasing sample size. Thus the sample size 
influences model selection by BIC-type criteria not only through the standard error but 
also on its own. This is an important point to take into account when planning studies 
using BlC-like criteria. 


3.4. Model averaging 


Instead of selecting one model, model averaging can also be considered. From a Bayesian 
perspective model averaging arises as soon as a prior distribution supported on a can¬ 
didate set of models is used, because the posterior distribution will then also be based on 


Wa.sserma.nn 


the sa me candidate models, weighted by their posterior model probability; see 
( 2000 1. Non-Bayesian model averaging methods have also been proposed; see lHiort and Claeskens 


(I 2 OO 3 II for a detailed description. For a given quantity of interest, say /i, these model 
averaging estimators are obtained by calculating a weighted average of the individual 
estimators of the candidate models A^i,... ,Ml- One way of determining the model 
weights is to use transformations of the model selection criteria for each candidate model. 
More precisely, let fi denote the parameter of interest (e.g. the effect at a specific dose 
level) and fii the estimator of /i using the model Aii,i = 1,... ,L. Then the model 
averaging estimator based on the model selection criterion I is given by 




(3.8) 


£=i 


with corresponding weights iHiort and ClaeskensI (1200311 : iBuckland et al.l (119971) 

exp(0.5 I{Mi)) 




E,.i«xp(0.5 I{Mi)) 


(3.9) 


In Section 0] we will investigate model averaging estimators for each of the information 
criteria in Table iTTl Note that when BIG or BIC 2 are used, the resulting model weights 
are approximations of the underlying posterior model probabilities in a Bayesian model, 
althou gh other criteria, such as the AIC have a Bayesian interpretation as well; see [Clyde 

(l2000ll . 


3.5. Bootstrap model averaging based on AIC and BIC 

An alternative way to perform model averaging is to use bagging (bootstrap aggregating), 
as proposed bv iBreinianI (119961) . In the following we investigate two estimators of /i based 
on bootstrap model averaging using either AIC or BIC. The essential idea is to bootstrap 
the model selection and use all bootstrap predictions for one final prediction. As different 
models might have been selected in each bootstrap resample, this method can also be 
considered as a model averaging method. 
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To be precise consider the sample (di, j/n), ..., (di, j/inJ, • • •, (4, Vki), • • •, (4, l/fcnj 
with ~ where rii > 1 for all i We determine the bootstrap 

estimator of fi based on the AIC and the BIC using R bootstrap samples: 

1. Bootstrap step: 

Perform a stratihed bootstrap on the sample 


I/ll)) • • • ) (*^1) l/lni)) • • • ) (4) Vkl) • • • ) (dfc, ykniJ)- 

That is, we select a random sample {di, y*^),, (dj, of size rij out of {di, yn), ..., (dj, yir^) 
with replacements for every dose dj, i = 1,..., fc. 

2. Model selection step: 

Calculate the AIC and BIC value for every competing model based on the boot¬ 
strap sample. Select the model with the largest AIC and BIC value and estimate 
the parameter of interest y, based on the selected model. The resulting estimators 
are denoted by and /Ific, respectively. 

From the R bootstrap samples the medians of the R different estimators y^ic 
y^ic are used as the bootstrap estimators for y. 


4. Simulations 

In this section, we report the results of an extensive simulation to investigate and com¬ 
pare the different model selection and averaging approaches in scenarios that are realistic 
for Phase II dose hnding trials. In Section 14.11 we introduce the design of the simula¬ 
tion study, including its assumptions and scenarios. In Section 14.21 we describe the 
performance measurements used to evaluate the different approaches. In Section we 
summarize the results of the simulation study. 


4.1. Design of simulation study 


Following the simulation setup of iBornkamp et al.l (120071) . we investigate different con¬ 
stellations of sample sizes, number of active dose levels, and true dose response mod¬ 
els. More precisely, we consider two sample sizes N = 150 and N = 250 for each of 
four different designs S = (di,... ,4) of k = k(E) active dose levels, assuming either 
hve (A = {0,2,4, 6 , 8 }, k{A) = 5), seven {B = {0,2,3,4,5, 6 , 8 }, k{B) = 7), nine 
{C = (0,1, 2, 3,4, 5, 6, 7, 8}, k{C) = 9) or four {D = {0, 2,4, 8}, k{D) = 4) active dose 
levels. In each case the total sample size N is equally distributed across the different 
active dose levels. If -j^ is not an integer, we use a rounding procedure provided by 


(iPukelsheiml . 120061. p. 307). Further, we assume the hve dose response models described 
in Section [2] with parameters given in Table 12.11 as true models in the simulation. The 
errors in model (13.2p are normally distributed with standard deviation \/4.5. 
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Thus, a scenario S is defined by the used total sample size N, the used design S 
and the model used for generating the data. For example, one scenario is given by 
N = 150, S = C and the Emax model as data generating model. Summarizing, there 
are iS = 2 • 4 • 5 = 40 scenarios (two possible sample sizes, four possible designs and five 
possible data generating models). 

In the first simulation study we exclude the ANOVA model from the list of candidate 
models under consideration, focussing on the first four models in Table 12.11 That is, 
if the ANOVA model is used for generating the data, no dose response model in the 
candidate set can exactly fit the underlying truth, so that in this case we investigate 
the behavior under model misspecification. In the second simulation study we will add 
the ANOVA model to the candidate models under consideration, thus using all five 
models from Table 12.11 Furthermore, we exclude the Sigmoid Emax model from the set 
of candidate models in scenarios based on the design D, since its parameters are not 
estimable under D. All results are based on Ngim = 1000 simulation runs per scenario. 
In each simulation run the parameters of the different candidate models are estimated 
and the value /(A4) is calculated for each model selection criterion specified in Table ITTI 
and for each dose response model specified in Table 12.11 The bootstrap model averaging 
approach was used with R = 500 bootstrap simulations for eac h simulated trial. 

All simulations are performed using the R-package DoseFinding lBornkamp et al.l (1201311 . 


4.2. Measurements of performance 

We use the standardized mean squared error (SMSE) and the averaged standardized 
mean square error (ASMSE) to assess estimation of the dose effects and the target dose 
of interest, as well the proportion of selecting the correct dose response model to evaluate 
the performance of the model selection criteria. 

For a given scenario S (out of the 5 = 40 scenarios) let Oj^s) denote the esti¬ 

mated regression model with corresponding estimated model parameters Oj^s which is 
selected by a given model selection criterion I in the j-th simulation run. Moreover, let 
Vsi'i^s) denote the data generating dose response model of the scenario S. The mean 
squared error (MSE) of the treatment effect estimator at dose level d is then given by 

1 ^sim 2 

MSE(d,S) = E VhMdAs) - ils{d,es)) . 


The average mean squared error (AMSE) for an arbitrary design S with A;(S) different 
active dose levels is given by AMSE(S, S) = Yliti MSE((ij, S). In order to obtain 
comparability between the scenarios it is useful to standardize the average mean squared 
errors. This is achieved by dividing AMSE(S, S) by the minimal average mean squared 
error MMSE(S, S) = min^^ ^ J2iLf {vM{di, 9^^) - ris{di, 9s)Y where the 


^(k) 

minimum is taken with respect to all models rjM and 6)^ 


is the maximum likelihood 
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estimator of model M in the /c-th simulation run of scenario S. The standardized MSE 
(SMSE) of scenario S for a specihc selection criterion is then given by 


SMSE(S,^) 


AMSE(S, S) 
MMSE(S, S) 


(41) 


and the averaged standardized MSE (ASMSE), i.e. the SMSE averaged over all simula¬ 
tion scenarios, is given by 


ASMSE(S) = 4 V SMSE(S, S^). (4.2) 

S=\ 

Moreover, we consider model selection procedures to estimate the target dose achieving 
an effect of 5 = 1.3 over placebo td^^( 6 *_A 4 ) = ^ given dose response model 

A4. Similarly as above, we then define the MSE as 


-I ^sim c 

MSE„,,(S) - td,,(«s))' 

sim • 1 ' 


(4.3) 


Note that those simulation runs, where the estimated target dose is not contained within 
the dose range, are excluded from the MSE calculation. The standardization of the MSE 
is again achieved by dividing (US} by MMSEtd(5') = min,^^ “ 

td^g( 6 ' 5 ))^ where the minimum is calculated with respect to all models rjj^ under con¬ 
sideration. The standardized MSE (SMSE) of scenario S for the target dose is then 
given by 


SMSEtd(^) 


MSEtd(^) 

MMSEtd(^) 


(4.4) 


and the averaged standardized mean square error for the target dose (ASMSEtd) is again 
obtained by averaging the SMSEs over all scenarios. 

Note that the estimator of the target dose is calculated by interpolation if the ANOVA 
model is selected by the model selection criterion I. 


The model averaging estimators fik{d) for the dose effect d and the target dose are 
obtained from fl3.8p and fl3.9p . where the parameter fi is given by r]{d,9) and td^( 6 *), 
respectively. The dehnition of the weights in the model averaging procedure is slightly 
modihed if the target dose estimator of a model lies outside the dose range. In this 
case the estimator is not used and the model averaging estimator is calculated from the 
weights of the remaining models if their weights sum up to a value greater than 20 %. 
Otherwise this case is excluded. For bootstrap model averaging a similar approach is 
used, when there are more than 80% of the target dose estimators lying outside the 
design space for a given bootstrap run, it is excluded from the calculation. 
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criterion 

AIC 

BIC 

BIC 2 

TIC 

AICc 

ASMSE(A) 

1.35 

1.54 

1.41 

1.35 

1.51 

ASMSE(B) 

1.38 

1.56 

1.44 

1.38 

1.55 

ASMSE(C) 

1.35 

1.46 

1.38 

1.35 

1.48 

ASMSE(D) 

1.37 

1.58 

1.44 

1.37 

1.55 

ASMSE^^ 

1.70 

2.35 

1.92 

1.71 

2.74 


Table 4.1: The averaged standardized mean squared errors (ASMSE, cf. fl4.2p ) for the 
designs A, B, C and D and for the target dose under the consideration of different model 
selection criteria. (The best values per row are printed in bold.) 


criterion 

AIC 

BIC 

BIC 2 

TIC 

AICc 

AIC-Boot 

BIC-Boot 

ASMSE(A) 

1.24 

1.39 

1.28 

1.24 

1.26 

1.21 

1.29 

ASMSE(B) 

1.26 

1.40 

1.30 

1.26 

1.28 

1.24 

1.30 

ASMSE(C) 

1.23 

1.33 

1.25 

1.24 

1.25 

1.21 

1.24 

ASMSE(D) 

1.25 

1.42 

1.30 

1.25 

1.27 

1.23 

1.31 

ASMSE^^ 

1.54 

1.88 

1.62 

1.54 

1.90 

1.30 

1.44 


Table 4.2: The averaged standardized mean squared errors (ASMSE, cf. (14. 2 p ) for the 
designs A, B, C and D and for the target dose with respect to model averaging and 
bootstrap model averaging. (The best values per row are printed in bold.) 


4.3. Simulation Results 

In Section 14.3.II and Section 14. 3. 21 we present the simulation results corresponding to the 
candidate set consisting of linear, quadratic, Emax and Sigmoid Emax model. In 14.3.31 
we analyze how the performance of the model selection criteria and model averaging 
methods change if the ANOVA model is added to the candidate set. 

4.3.1. Results based on the candidate models 1-4 in Table I^TTl 

First, we consider the case where the ANOVA model is not among the candidate models 
used for analysis. In Table 14.11 and 14.21 we display the ASMSEs defined in (14.2p for the 
designs A, B,C,D and for the target dose. 

The AIC and the TIC perform similarly and have the best average performance across 
all scenarios both for model selection and model averaging. Comparing model selection 
and model averaging it can be observed that the model averaging procedures generally 
perform slightly better on average than model selection. 

To get an idea of the performance in each individual scenario we ranked the model 
selection and model averaging approaches for each scenario according to their perfor¬ 
mance and display the ranks in Figure ITTl One can clearly see that almost all criteria 
perform best in some scenarios and worst in other scenarios, so no clear best criterion 
can be identified. It is interesting to observe, however, that for the BIC the performance 
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AlCc AlC-Boot BIC-Boot 


■ 

Rank 1 

□ 

Rank 2 

□ 

Rank 3 

□ 

Rank 4 

□ 

Rank 5 

□ 

Rank 6 

□ 

Rank 7 


Figure 4.1: The distribution of the ranks over all scenarios for the SMSE of dose effect 
in design S = C. Left: model selection, right: model averaging methods. 


is either very good or very bad, while for the AIC or TIC the performance is more bal¬ 
anced across all scenarios. The mixed performance of the BIG is due to the fact that it 
penalizes the complexity of a model more strongly. Consequently, it prefers the linear 
model because of its smaller number of parameters, even if it is not an adequate model. 
However, in situations when the linear model is the true one, the BIC performs best. 

In terms of probabilities to select the true model, we observe that the AIC performs 
best with respect to these criteria; see Figures TB.11 IB.21 and IB.31 in Appendix iBl for the 
detailed results. The averaged probabilities over all scenarios to select the true model 
are given by 43%, 34%, 39%, 42% and 23% for the AIC, BIC, BIC 2 , TIC and AICp, 
respectively, which also show some advantages for model selection based on the AIC. 
Summarizing, with this set of candidate models, (linear, quadratic, Emax and Sigmoid 
Emax), the AIC based estimators AIC and TIC (both for model selection as well as for 
model averaging) outperform those based the other criteria. 

4.3.2. Model Selection vs. Model Averaging 

In this section we compare the results of model averaging with those of model selection 
in more detail. In terms of the average performance, we observed that model averaging 
outperforms model selection (see Tables 14.11 14.21) . We now investigate the individual 
results for each scenario, see the left plots in Figures [4.21 and 14.31 which correspond to 
the SMSE of the dose effect in fl4.1l) and the target dose in fl4.4p . The dashed line 
in the left panel of Figure 14.21 displays the situations where the SMSE of the model 
selection based estimator and the SMSE of the model averaging estimator are equal. The 
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o AlC 


A AlCc 


+ BIG 


X BIC 2 


O TIC 



MSE, MSE, 


Figure 4.2: Comparison of model selection, model averaging and bootstrap for estimating 
the dose effects in design C. The Figure shows the SMSE values. Left panel: model 
selection (MSEi) versus model averaging (MSE 2 ) . Right panel: model averaging (MSEi) 
versus bootstrap model averaging (MSE 2 ). 


points below (above) the diagonal correspond to scenarios where the model averaging 
(selection) estimators have a smaller SMSE. For example, SMSE(C', S') = 1.73 for BIG 
model selection, but 1.48 for BIG model averaging in the Emax scenario S with sample 
size 250 under design B, indicating that the BIG model averaging estimator is more 
precise than the model selection estimator in this scenario. 

One observes that across all scenarios model averaging tends to outperform model 
selection consistently, resulting in smaller SMSE, even though the differences are never 
substantial. The ratio of the SMSE(C', S) for BIG model selection and the SMSE(C', S) 
or BIG model averaging is given by 1.17, which means that on average 17% more ob¬ 
servations are needed for model selection in order to result in an estimator of similar 
precision as obtained with the corresponding model averaging approach. Note that the 
individual improvement obtained by model averaging depends on the selection criterion. 
Gomparing the improvement by model averaging with respect to target dose estimation 
is even more substantial (see the left panel in Figure S3]). 

In the right panels of Figure |1]2] and of Figure 1473) we compare the model averaging 
using bootstrap with that based on the AIG and the BIG weights. We observe that 
the bootstrapping estimators yields slightly better results than the model averaging 
estimators except for the linear scenarios. For example, the SMSE(G, S) belonging to 
BIG model averaging is equal to 1.48 whereas the corresponding SMSE(C', S) of the BIG 
bootstrap estimator is smaller (SMSE(C', S) = 1.29) in the Emax scenario with sample 
size 250, design B (see red line in the right panel of Figure IT!2|) . 

The reason why the performance is worse for the linear model is that in general, more 
complex models are preferred by bootstrap model averaging (especially when using the 
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o AlC 


A AlC, 


+ BIC 


X BIC 2 


O TIC 


1 2 3 4 5 



Figure 4.3: Comparison of model selection, model averaging and bootstrap model aver¬ 
aging for estimating the target dose. The Figure shows the SMSE. Left panel: model 
selection (MSEi) versus model averaging (MSE 2 ) . Right panel: weights based model 
averaging (MSEi) versus bootstrap model averaging (MSE 2 ) . 


AIC) which implicates a lower selection probability for the linear model. This behavior 
improves the performance of the bootstrap estimators in the non linear scenarios whereas 
it gets worse in the linear scenarios. 

4.3.3. Simulation Results based on the candidate models 1-5 in Table 1^71] 

From a practical point of view adding the ANOVA model to the set of candidate models 
can be considered helpful to safeguard against unexpected shapes, as the ANOVA model 
is extremely flexible. 

To compare the different criteria for model selection and model averaging, we calcu¬ 
lated the same metrics as in the last section. In this case the superiority of the AIC 
and TIC cannot be observed anymore (see Figure 14.4j) . The model selection criteria 
perform more similarly compared to each other (see Tables [CTTl 1C.21 in Appendix ICl). 
In general, however, model averaging still outperforms model selection on average, the 
only exception to this is bootstrap model averaging based on the AIC. The ANOVA 
approach represents a rather complex model (one parameter per dose) and it seems that 
the AIC does not penalize this complexity strongly enough, thus leading to an inferior 
performance. The BIC is not affected similarly since it uses a higher penalty. 

Considering the direct comparison of both candidate sets (namely the one with ANOVA 
and the one without ANOVA) using the SMSE (see Figure 14.51 left plot) the criteria 
mostly perform better if the ANOVA model is not among the candidate models. 

Model averaging estimators also perform better if the ANOVA model is not among 
the candidate models. For bootstrap model averaging (see right panel in Figure IT5|) one 
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Rank 1 

□ 

Rank 2 

□ 

Rank 3 

□ 

Rank 4 

□ 

Rank 5 


AlCc AiC-Boot BIC-Boot 


Figure 4.4: The distribution of the ranks over all scenarios for the metric SMSE dose 
effect in design C (left: model selection, right: model averaging). The AN OVA model is 
among the candidate models. 


o AlC A AlCc + BIG X BIC2 O TIC 
1.0 1.5 2.0 



MSE with ANOVA 


Figure 4.5: Comparison of the SMSE of dose effect estimators in design C with and 
without the ANOVA model (left panel: model selection, center panel: weight based model 
averaging, right panel: bootstrap model averaging). 


can clearly see that the AIC with the ANOVA candidate model gets much worse, while 
the BIG is not affected. 

Summarizing, the performance of the model selection criteria depends sensitively on 
the candidate model set. Including the ANOVA model does not improve the performance 
of all criteria, it sometimes even deteriorates the performance. Of course this is due to 
the fact that the dose response shape for the ANOVA model considered here can be 
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candidate models 

AIC 

BIC 

values 

weights 

Bootstrap 

values 

weights 

Bootstrap 

Linear 

52.17 

15% 

26% 

41.48 

58% 

64% 

Quadratic 

53.50 

30% 

36% 

39.24 

19% 

17% 

Emax 

53.84 

36% 

30% 

39.58 

22 % 

19% 

SigEmax 

51.85 

13% 

0 % 

34.03 

1 % 

0 % 

ANOVA 

50.14 

6 % 

8 % 

28.75 

0 % 

0 % 

candidate models 

BIC 2 

TIC 

AICc 

values 

weights 

values 

weights 

values 

weights 

Linear 

47.00 

34% 

52.40 

16% 

52.09 

16% 

Quadratic 

46.59 

28% 

53.73 

30% 

53.36 

30% 

Emax 

46.93 

33% 

54.05 

35% 

53.70 

36% 

SigEmax 

43.21 

5% 

52.07 

13% 

51.64 

13% 

ANOVA 

39.78 

0 % 

50.36 

6 % 

49.85 

5% 


Table 5.1; The different values of the selection criteria, the corresponding model averaging 
weights (in %) and the relative freguency (in %) of the AIC and BIC bootstrap in the 
COPD case study. 


approximated roughly by the other candidate models. If more extreme, irregular shapes 
were used, inclusion of the ANOVA model could improve performance. 


5. COPD Case Study Revisited 

Taking into account the results from Sections [3] and 0] we now return to the COPD 
case study and the three questions posed in Section [2l which were (i) which of the 
candidate models should be used for the dose response modeling step, (ii) whether 
model selection or averaging should be used, and (hi) which specific information criteria 
should be employed to perform either model selection or averaging. 

All dose response models introduced in Table 12.11 were fitted to the COPD data from 
Section [2j The model fits are displayed in Figure ID. II in Appendix [Dl Visually all 
model fits are adequate, perhaps with the exception of the linear model, which seems to 
overestimate the placebo response. 

When observing the results for the different information criteria in Table 15.11 one 
can see that the AlC-type criteria are rather consistent among each other and favor 
the Emax with 36%, the quadratic model with 30%, followed by the Sigmoid Emax 
and the linear model with roughly 15% each and the ANOVA model with 6% in terms 
of model weights. The BIC-related criteria give more weight to the linear model (as 
already observed in the simulations), as they penalize the number of parameters more 
strongly. The BIC penalizes here considerably more strongly than the BIC 2 , giving 
58% weight to the linear model, while the BIC 2 gives around 30% to each of the linear. 
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Figure 5.1: 


The fitted models after model selection with respect to different criteria. 


quadratic and Emax models. The fitted curves based on model selection and model 
averaging for all approaches are displayed in Figure 15.11 It can be seen that for most 
methods the difference between model selection and model averaging is not very large 
in this example, because the models that accumulate model weights lead to relatively 
similar fits. For the BIC and the BIC 2 , however a substantial difference can be observed 
between model averaging and selection; The linear model gets selected, but the Emax 
and quadratic model have almost equally large model weights. Model averaging seems 
particularly important in these situations, to adequately reflect the uncertainty in the 
modeling process. 

Regarding question (i) and (ii); The simulations in the last section showed a consis¬ 
tent benefit of model averaging over selection. So the proposal would be to use model 
averaging here. There does not seem to be a major difference between the weight-based 
and bootstrap model averaging in this particular example (see Table 15.11 and Figure 
15.ip . Regarding question (iii), in the simulations for a similar candidate set of models 
(which did not include ANOVA) the BIC showed a slightly worse behavior than the 
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AIC, suggesting the use of the AlC-related criteria in this scenario. 

The two main objectives of the study, were to evaluate the shape of the dose-response 
curve and doses achieving an effect of 0.1 — 0.14 liters on top of placebo. So in our 
situation we will use model averaging with the AIC with bootstrapping to answer these 
questions. The placebo effect of the curve was estimated with 1.25 (95 % Cl: [1.18, 1.31]). 
Within the observed dose-range increases monotonically up to an effect of 0.14 (95 % Cl: 
[0.06, 0.22]) at the maximum dose of lOOmg. At the 50mg dose 0.13 (95 % Cl: [0.04, 0.20]) 
91.22 % (95 % Cl: [ 50.00%, 164.21%]) of the maximum effect is achieved, indicating 
that a plateau-like level is achieved there. The increasing part of the curve is between 0 
and 50mg. 

6. Conclusions 

This paper compared different existing methods for model selection and model averaging 
in terms of their mathematical properties and their performance for dose-response curve 
estimation in a large scale simulation study. 

In terms of their mathematical properties, it was reviewed and illustrated that the 
BIC-type criteria are consistent, while the AlC-type criteria asymptotically tend to 
prefer too complex models (see Figures 13.11 and 13.2p . It was also investigated, that 
BIC-type criteria select different models for different total sample size, even when the 
estimated dose-response curves and the uncertainty around each dose-response curve are 
the same {i.e. the conhdence intervals width around the curve). This is different from 
other situations in clinical trial design, where only the standard error is important, not 
the total sample size by itself, and an important point to take into account at the trial 
design stage when using BIC-type criteria. 

In terms of the simulation results we considered two candidate set of models, one did 
not include an ANOVA model and one included an ANOVA model. In the hrst situation 
AlC-type criteria overall performed slightly better than the BIC, which penalized the 
more complex models too strongly for most situations. However, when allowing the 
ANOVA model to be selected as well, it turned out that the AIC selected it too often 
in some situations, leading to a decreased performance. However, over all scenarios, 
models and model selection criteria there seemed little value in adding an ANOVA 
based model to the set of candidate models, for the scenarios we evaluated. Approaches 
that selected this model more often (like the AlC-type criteria) decreased in performance 
most, compared to the candidate set without the ANOVA model. 

The most general observation from the simulation is however that the model averaging 
methods outperformed the corresponding model selection methods. Even though the 
beneht is typically not large, it is consistent across considered candidate sets, models, 
designs, total sample sizes, performance metrics and methods. In terms of which model 
averaging method to use (weights based or bootstrap) no clear message emerges. An 
advantage of the bootstrap model averaging method from a pragmatic perspective is 
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that confidence intervals that take into account model uncertainty, are straightforward 
to obtain from the generated bootstrap samples. 
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A. Appendix: Background on model selection criteria 
based on the AlC 

As in Section El let denote the ML estimator in model {£ = 1,... ,L). As shown by 


Whitel (1198211 this estimator converges in probability to the KL-divergence minimizing 


parameter under certain regularity conditions. 
This gives the estimated KL-divergences 


KL{p, 




9{y\di) 


^Pi(y\di,9i 

Note that this term is a random variable with expected value 


g{y\di)dy,i = 1 ,... ,L. 


E 

i=l 


N 


g{y\di) \og{g{y\di))dy - E 


g{y\di) logpi{y\di,9i)dy 
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because here the estimator 6^ is considered as hxed. Both the hrst and the second term 
within the sum depend on the true density g{y\d), whereas only the second one depends 
on the considered model Aii and its estimator Thus, we only need to estimate the 
term 

' k „ 

/ 9{y\di)logpe{y\di,ee)dy 


Qn(M,) := E[i?„] := E 


2=1 


in order to distinguish the quality of approximations. 


For the estimation of QN{-M.e) we replace the expected value and integral by the mean 
depending on the observations; Thus, an estimator for Q{Aie) is given by 

^ k Ui ^ 

Qn{-M.£) = — EE log Pi{Yij\di,9i) = —meixlog{CN{Me,9e)), 


N 


i=l j=l 


N 6i 


where Cn{M.i^9x^) is the likelihood function of model evaluated at the parameter 
9i,. In principle a model could be chosen from A^i,..., which leads to the largest 
value of (f' = 1,..., 5). However, this naive estimator usually chooses the model 

with the largest number of parameters which often leads to an overht of the data. This 
property is a consequence of the fact that the log likelihood function is an increasing 
function of the dimen sion du oi the paraineter 9m . It is even possible to calculate the 
approximate bias tsee IClaeskens and HiortI (120081) 1 as 


E[Q^(A1,)] - Qn{M,) ^ 


where pen} = tr{K{M.()J 


A'(V<,) = E^E 


2 = 1 


N 


dlogp^(Yjdi,9^) 

/ dlogp^(Yjdi,9^)\'^ 

_1 

-1 


denotes the Fisher information matrix and the matrix J, 

■^2 logp^{Y\di, 9i) 


. i=l 


^-‘(ai<) = -(Ev’= 


d^9e 


the negative inverse of the expectation of the second derivative of the log likelihood 
function. If the considered density pi and the true density q coi ncide (i.e. model pi is 
the true one) and certain regularity conditions (c.f. IWhitd (119821) 1 are fulhlled, we have 
K{Aii) = and consequently pen} = where dj^c denotes the dimension of 

the parameter 9^. 

In conclusion, a bias corrected estimator for the second part of the KL-divergence 
is given by (13.5p . As outlined in Section 13.11 the AlC-based criteria from Table 13.11 
are based on this estimator using different estimators for the penalty term 
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pengj = pen\. Note that for the TIC the two matrices K and J ^ and thus are 
explicitly estimated by 


k{Mi) = ^ d\ogpi{yij\duei) ( d log piiy^diji) 


i=i j=i 


Tli 


86 , 




86 , 


and 


k rii 


AM,) = -EE 


1 8^ logp,{yij\di, 6,) 


i=i j=i 


Ui 


8 ^ 6 , 


respective ly. _ The resulting penalty term is therefore given by tr(J ^{M.,)k{M.,)) 


Takenchil fllQTbll . 
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B. Selection Probabilities 


In this Section the probabilities that a selection criterion chooses a response model given 
a specific scenario are displayed in the case where the candidate models are given by the 
linear, the qnadratic, the Emax and the Sigmoid Emax model. 
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Figure B.l; The probability that the AIC (left) and the AICq (right) choose a response 
model given a scenario. 
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Figure B.2; The probability that the BIC (left) and the BIC 2 (right) choose a response 
model given a scenario. 
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o P(linL.) + PfemaxL.) 

A P(quacl|...) X p(sigEmax|...) 
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Figure B.3: The probability that the TIC chooses a response model given a scenario. 

C. Tables for Simulation Study with ANOVA 



AIC 

BIG 

BIC 2 

TIC 

AICc 

Probability 

41% 

34% 

39% 

41% 

22% 

ASMSE(A) 

1.52 

1.55 

1.45 

1.52 

1.57 

ASMSE(B) 

1.57 

1.57 

1.47 

1.56 

1.60 

ASMSE(C) 

1.50 

1.47 

1.40 

1.50 

1.53 

ASMSE(D) 

1.51 

1.59 

1.47 

1.51 

1.59 

ASMSE^^ 

1.73 

2.43 

1.96 

1.74 

2.74 


Table C.l; The averages of the standardized mean squared errors taken over all scenarios 
for model selection. ANOVA is among the candidate models. 



AIC 

BIC 

BIC 2 

TIC 

0 

0 

1—1 

C 

AIC-Boot 

BIC-Boot 

ASMSE(A) 

1.36 

1.39 

1.31 

1.36 

1.29 

1.56 

1.30 

ASMSE(B) 

1.39 

1.40 

1.32 

1.39 

1.30 

1.63 

1.31 

ASMSE(C) 

1.34 

1.33 

1.27 

1.34 

1.28 

1.54 

1.25 

ASMSE(D) 

1.36 

1.42 

1.32 

1.36 

1.30 

1.53 

1.32 

ASMSE^^ 

1.55 

1.94 

1.64 

1.55 

1.87 

1.28 

1.43 


Table C.2: The averages of the standardized mean squared errors taken over all scenarios 
for model averaging and bootstrap model averaging. ANOVA is among the candidate 
models. 
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Figure C.l; Comparison of model selection, model averaging and bootstrap for estimating 
the dose effects in design C. The Figure shows the SMSE values. Left panel: model 
selection versus model averaging. Right panel: model averaging versus bootstrap model 
averaging. AN OVA is among the candidate models. 
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Figure C.2; Comparison of model selection, model averaging and bootstrap model aver¬ 
aging for estimating the target dose. The Figure shows the SMSE. Left panel: model 
selection versus model averaging. Right panel: weights based model averaging versus 
bootstrap model averaging. AN OVA is among the candidate models. 
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